This report contains information regarding the analysis pipeline for Alexey Fedulov’s rrbs analysis’
Overview: 22 samples of Sus scrofa (pig) were received as DNA. Of the 22 samples, samples 9, 13, 17, and 20 were not sequenced, thereby leaving 18 samples. The samples were subjected to enzymatic digestion (MSP1 + TaqI), library preparation, bisulfite conversion, Bioanalyzer QC, KAPA library quantification, multiplex NGS on an Illumina HiSeq4000 and advanced data analysis.
The following steps were performed in the analysis pipeline:
Initial QC of raw reads was run using FASTQC (Andrews S. (2010)).
Reads were trimmed using Trim Galore and Trimmomaic.
Bismark was used to prepare the reference genome and to align trimmed reads.
Bismark methylation extractor was used to extract methylation information.
The edgeR package was used to find differentially methylated loci (Chen et al. 2018).
Initial QC analyses of raw reads was performed in FastQC (V0.11.5). To remove overrepresented sequences, adapters, and low-quality sequences, trimming was performed via a two-step process. First, trimming was performed using trim galore with the following parameters:
trim_galore --quality 20 --clip_R1 6 --adapter AGATCGGAAGAGC --output_dir ${dir} --rrbs $sample
Then on the trimmed reads, trimming was once again performed in trimmomatic using the following parameters:
- trimmomatic:
subcommand: SE
options:
" ILLUMINACLIP:/gpfs/data/cbc/cbc_conda_v1/envs/cbc_conda/opt/trimmomatic-0.36/adapters/TruSeq3-SE.fa:2:30:5:6:true":
"SLIDINGWINDOW:4:20 MINLEN:35":
| Library | Raw reads | Trimmed reads |
|---|---|---|
| Rh1_R1_001 | ||
| Rh2_R1_001 | ||
| Rh3_R1_001 | ||
| Rh4_R1_001 | ||
| Rh5_R1_001 | ||
| Rh6_R1_001 | ||
| Rh7_R1_001 | ||
| Rh8_R1_001 | ||
| Rh10_R1_001 | ||
| Rh11_R1_001 | ||
| Rh12_R1_001 | ||
| Rh14_R1_001 | ||
| Rh15_R1_001 | ||
| Rh16_R1_001 | ||
| Rh18_R1_001 | ||
| Rh19_R1_001 | ||
| Rh21_R1_001 | ||
| Rh22_R1_001 |
| Library | Raw reads | Trimmed reads |
|---|---|---|
| Rh1_R1_001 | ||
| Rh2_R1_001 | ||
| Rh3_R1_001 | ||
| Rh4_R1_001 | ||
| Rh5_R1_001 | ||
| Rh6_R1_001 | ||
| Rh7_R1_001 | ||
| Rh8_R1_001 | ||
| Rh10_R1_001 | ||
| Rh11_R1_001 | ||
| Rh12_R1_001 | ||
| Rh14_R1_001 | ||
| Rh15_R1_001 | ||
| Rh16_R1_001 | ||
| Rh18_R1_001 | ||
| Rh19_R1_001 | ||
| Rh21_R1_001 | ||
| Rh22_R1_001 |
| Library | Raw reads | Trimmed reads |
|---|---|---|
| Rh1_R1_001 | ||
| Rh2_R1_001 | ||
| Rh3_R1_001 | ||
| Rh4_R1_001 | ||
| Rh5_R1_001 | ||
| Rh6_R1_001 | ||
| Rh7_R1_001 | ||
| Rh8_R1_001 | ||
| Rh10_R1_001 | ||
| Rh11_R1_001 | ||
| Rh12_R1_001 | ||
| Rh14_R1_001 | ||
| Rh15_R1_001 | ||
| Rh16_R1_001 | ||
| Rh18_R1_001 | ||
| Rh19_R1_001 | ||
| Rh21_R1_001 | ||
| Rh22_R1_001 |
--bowtie2) and mapping was done for PBAT libraries (--pbat) using the following parameters:bismark -o /gpfs/data/cbc/fedulov_alexey/porcine_rrbs/alignments/update --bowtie2 --genome /gpfs/data/shared/databases/refchef_refs/S_scrofa/primary/bismark_index/ --un --pbat $trimmed_read
| LIBRARY | NUMBER_READS | MAPPED_READS | UNMAPPED_READS | DUPLICATED_READS | DUPLICATION_RATE |
|---|---|---|---|---|---|
| Rh1_R1_001 | 10,845,489 | 10,845,489 / 100% | 0 / 0% | 1,437,581 / 13.26% | 12.6% |
| Rh2_R1_001 | 12,892,543 | 12,892,543 / 100% | 0 / 0% | 1,496,670 / 11.61% | 10.93% |
| Rh3_R1_001 | 8,715,481 | 8,715,481 / 100% | 0 / 0% | 1,029,908 / 11.82% | 11.02% |
| Rh4_R1_001 | 9,968,480 | 9,968,480 / 100% | 0 / 0% | 1,209,925 / 12.14% | 11.51% |
| Rh5_R1_001 | 13,925,551 | 13,925,551 / 100% | 0 / 0% | 2,056,379 / 14.77% | 13.91% |
| Rh6_R1_001 | 13,362,404 | 13,362,404 / 100% | 0 / 0% | 1,750,619 / 13.1% | 12.31% |
| Rh7_R1_001 | 12,851,559 | 12,851,559 / 100% | 0 / 0% | 1,713,430 / 13.33% | 12.48% |
| Rh8_R1_001 | 12,872,887 | 12,872,887 / 100% | 0 / 0% | 1,769,849 / 13.75% | 12.93% |
| Rh10_R1_001 | 15,060,158 | 15,060,158 / 100% | 0 / 0% | 2,107,684 / 14% | 13.11% |
| Rh11_R1_001 | 13,432,863 | 13,432,863 / 100% | 0 / 0% | 1,930,952 / 14.37% | 13.51% |
| Rh12_R1_001 | 14,880,989 | 14,880,989 / 100% | 0 / 0% | 2,381,186 / 16% | 14.75% |
| Rh14_R1_001 | 13,023,116 | 13,023,116 / 100% | 0 / 0% | 1,938,398 / 14.88% | 13.62% |
| Rh15_R1_001 | 13,292,288 | 13,292,288 / 100% | 0 / 0% | 1,992,503 / 14.99% | 13.72% |
| Rh16_R1_001 | 17,058,927 | 17,058,927 / 100% | 0 / 0% | 2,756,384 / 16.16% | 15.03% |
| Rh18_R1_001 | 12,196,440 | 12,196,440 / 100% | 0 / 0% | 1,814,764 / 14.88% | 13.42% |
| Rh19_R1_001 | 13,248,129 | 3,248,129 / 100% | 0 / 0% | 1,901,092 / 14.35% | 13.16% |
| Rh21_R1_001 | 16,030,293 | 16,030,293 / 100% | 0 / 0% | 2,624,856 / 16.37% | 14.84% |
| Rh22_R1_001 | 13,927,765 | 13,927,765 / 100% | 0 / 0% | 1,961,472 / 14.08% | 13.05% |
| LIBRARY | COVERAGE_MEAN | COVERAGE_SD | MEAN_MAP_QUALITY |
|---|---|---|---|
| Rh1_R1_001 | 0.1844 | 0.7664 | 28.42 |
| Rh2_R1_001 | 0.2191 | 0.8243 | 29.32 |
| Rh3_R1_001 | 0.1482 | 0.7405 | 27.77 |
| Rh4_R1_001 | 0.1694 | 0.7166 | 28.62 |
| Rh5_R1_001 | 0.2368 | 0.943 | 28.68 |
| Rh6_R1_001 | 0.2271 | 0.9149 | 29.01 |
| Rh7_R1_001 | 0.2185 | 0.8851 | 28.47 |
| Rh8_R1_001 | 0.2188 | 0.9475 | 29.15 |
| Rh10_R1_001 | 0.2561 | 1.0472 | 28.59 |
| Rh11_R1_001 | 0.2284 | 0.9772 | 28.47 |
| Rh12_R1_001 | 0.253 | 1.0263 | 29.06 |
| Rh14_R1_001 | 0.2214 | 0.9804 | 29.35 |
| Rh15_R1_001 | 0.226 | 0.9128 | 28.77 |
| Rh16_R1_001 | 0.29 | 1.2004 | 29.18 |
| Rh18_R1_001 | 0.2075 | 1.0798 | 28.59 |
| Rh19_R1_001 | 0.2252 | 0.9003 | 29.01 |
| Rh21_R1_001 | 0.2726 | 1.2379 | 29.02 |
| Rh22_R1_001 | 0.2368 | 1.1062 | 29.13 |
Bismark methylation extractor was used to extract methylation information using the following parameters:
bismark_methylation_extractor --bedGraph --comprehensive --ignore_3prime 6 -s --merge_non_CpG --report --output /gpfs/data/cbc/fedulov_alexey/porcine_rrbs/alignments/ --gzip --multicore 8 --genome_folder /gpfs/data/shared/databases/refchef_refs/S_scrofa/primary/bismark_index/ $sample
This produced m-bias plots, illustrating the methylation proportion across each possible position in the read (see Part 8 below).
| Library | M-bias plot |
|---|---|
| Rh1_R1_001 | |
| Rh2_R1_001 | |
| Rh3_R1_001 | |
| Rh4_R1_001 | |
| Rh5_R1_001 | |
| Rh6_R1_001 | |
| Rh7_R1_001 | |
| Rh8_R1_001 | |
| Rh10_R1_001 | |
| Rh11_R1_001 | |
| Rh12_R1_001 | |
| Rh14_R1_001 | |
| Rh15_R1_001 | |
| Rh16_R1_001 | |
| Rh18_R1_001 | |
| Rh19_R1_001 | |
| Rh21_R1_001 | |
| Rh22_R1_001 |
Genes were filtered such that only CpGs with at least 5 counts (methylated and unmethylated) in 3 samples were included for downstream analysis. Additionally, CpGs that were never methylated or always methylated were filtered out, as these provide no information about differential methylation. The resulting sample size after filtering was 84,999.
The glmFit function was used to fit a negative binomial generalized log-linear model. The experimental design matrix was constructed using modelMatrixMeth with a factorial experimental design (~0 + group), where group was a factor variable with levels comprised of each combination of treatment/diet/tissue. We dropped the intercept from our model to parameterize it as a means model. Subsequently, the glmLRT function was used to find differentially methylated loci for comparisons of interest, which were made by constructing contrast vectors. Individual CpG sites were considered differentially methylated if the nominal p-value was < 0.01. Results were filtered based on this nominal p-value threshold.
Results from the statistical analyses have been stored in excel files (and sent to you via email), with files being named by research question. Note that the excel files entitled both_HSM_and SMV.xlsx, SMV_effect_only.xlsx, and HSMV_effect_only.xlsx are the overlap in results lists from the HSMV normal vs HSMV ischemic and SMV normal vs. SMV ischemic comparisons.both_HSM_and SMV.xlsx includes loci for which there was an effect for tissue type in both high fat and normal diets; SMV_effect_only.xlsx includes loci for which there was an effect for tissue type in normal diet only (and not in high fat); and conversely, HSMV_effect_only.xlsx includes loci for which there was an effect of tissue type in high fat diet only (and not in normal diet).
The results files contain the following columns:
“loci column is the location of a given genomic location. The first number or letter is the chromosome, followed by a - symbol and a second number indicates the position or location on the chromosome.”
“logFC indicates the log2-fold change of expression between the two conditions tested. For SMVisch_vs_SMVnormal_results_filtered results, positive values indicate higher methylation rates in SMVnormal relative to SMVisch, negative values indicate lower methylation rates in SMVnormal vs SMVisch. For HSMVisch_vs_HSMVnormal_results_filtered results, positive values indicate higher methylation rates in HSMVnormal relative to HSMVisch, negative values indicate lower methylation rates in HSMVnormal relative to HSMVisch. For diffisch_vs_diffnormal_results_filtered results, positive values indicate that the difference due to normal vs ischemic is greater in SMV than HSMV. For MVM_vs_SMVischemic_results_filtered results, positive values indicate higher methylation rates in MVM relative to SMVischemic, negative values indicate lower methylation rates in MVM relative to SMVischemic.”
“logCPM is the average log2-counts per million, the average taken over all libraries used in the experiment.”
“LR is the likelihood ratio statistics (larger LR, smaller p-value)”
“PValue probability of obtaining results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct”
“BH is Benjamini & Hochberg corrected p-values”
“bonferroni Bonferroni corrected p-values”
“The columns with the format #-Me or #-Un indicate the methylated and unmethylated counts at a given locus for a particular sample.”
“Chr is the chromosome of the locus”
“Locus is the position of the locus on the chromosome indicated in the Chr column.”
“The columns with the format methylated_proportion_# indicate the proportion of methylated counts at a given loci for a particular sample, calculated as methylated counts / (unmethylated counts + methylated counts) for each sample and loci.”
“TSS_start indicates the TSS location nearest to the methylation loci shown.”
“uniprot_gn_symbol is the gene symbol (if available) associated with each TSS”
“ensembl_gene_id is the Ensembl gene ID associated with each TSS”
“gene_start and gene_end are the start and end genomic positions for each gene.”